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books. Based on this model, a mobile application for extracting text content 
from images in Vietnamese textbooks was built using OpenCV and Canny 
edge detection algorithm. There are 178 characters classes in Vietnamese with 


accents. However, within the scope of Vietnamese character recognition in 
textbooks, some classes of characters only differ in terms of actual sizes, such 
Keywords: as “c” and “C”, “o” and “O”. Therefore, the authors built the classification 
model for 138 Vietnamese character classes after filtering out similar character 
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1. INTRODUCTION 

Text recognition is the process of converting the text in one or more images into documents such as 
those written on a computer [1-3]. Currently, the word recognition problem has been studied and widely used 
in the automation of office operations in many languages like English [4, 5], Chinese [6, 7], Japanese [8]. In 
Vietnam, there are some identification systems such as VietOCR software [9] and VNDOCR [10]. with 
accuracy of about 90-95% [9, 10]. In general, the recognition of Vietnamese text has been a matter of concern 
for many scientific researchers based on the different models and techniques such as using HANDS-VNOnDB 
[11], utilizing RetinaNet for text detection and Inception-v3 CNN network for feature extraction, passing 
through bidirectional long short-term memory (Bi-LSTM)-based RNN for text recognition [12], Bi-LSTM 
combined with Conditional Random Field (CRF) [13], SSD Mobilenet V2 for text detection and Attention 
OCR (AOCR) for text recognition [14], CNN, a well-known for image processing applications [15-18], 
integrated with grammatical features for emotion detection [16]. 

Unlike the mentioned works, in this work, the authors focused on the printed word recognition from 
the Vietnamese educational textbook with a simplified set of several characters: 138 character classes instead 
of 178 character classes since some characters are different in sizes only. 


Journal homepage: http://beei.org 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 o 963 


2. RESEARCH METHOD 
2.1. Overview model for Vietnamese character recognition based on CNN 

Figure 1 shows an overview of the proposed CNN-based reduced-character-class-model. In Figure 1, 
in this work, the raw data are pictures taken by phone from textbook pages in which each page is a color image 
and a snapshot of the entire text in it. In order to make the data usable, they need to be pre-processed to form 
two data sets for CNN model training (training data and testing data). 

The CNN model's input is the pre-processed data (in this case, gray images of accented characters 
with a fixed size of 28x28 pixels), and the output is the images' corresponding labels. Here, it should be noted 
that the selected image size is significant for use in pattern recognition. There are 178 characters classes in 
Vietnamese that have accents, but within this article, based on analysis of words in Vietnamese textbooks of 
grade 1, there are some classes of characters that differ only in actual sizes, for example, “c” and “C”, “o” and 
“O”. Therefore, the authors only built identification models for 138 characters classes after filtering out similar 
characters. This is to increase the effectiveness of the developed model. 
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Figure 1. Overview of CNN-based reduced-character-class model for text recognition 


2.2. Pre-processing data 
Given an input image taken from the phone (RGB color system), the authors performed the pre- 
processing data steps as: converting the image to grayscale and binary images, finding the contour of tokens in 
the image, finding the contour of each character in each token, cutting each character and saving it. These pre- 
processing steps were performed in Java language in combination with the OpenCV library. The following 
details procedures for each step. 
a. Step 1: Converting to grayscale and binary images 
OpenCV [19-21] is an open-source library built for image processing. There are functions in the library that 
convert color images to grayscale and binary one based on the Otsu algorithm [22]. Different from [23] 
which used localization thresholding technique, this work utilized the existing functions from the library. 
b. Step 2: Finding the contour of each token 
To find the contour of each character, the authors built a function to perform the following: 
Perform closing operation to create a seamless connection of characters in a token; each token forms a 
block. However, the authors need to retain the original binary image to cut each character into pictures for 
the next step. 
- Next, implement the Canny [24-28] algorithm to find the contour of each word in the image. 
- Not only does the image contains text, but also other physical objects, so the boundaries found include 
those objects. These objects are retained, but when the identification step is taken, they are ignored. 
c. Step 3: Finding the contour of each character 
In this paper, a function has been built to accomplish this task: 
- The input of the function is images of Vietnamese tokens (including the boundary connections), the 
output is photos of the characters, including accents. 
This step again uses the Canny algorithm on a one-token binary image to find the contour of each token's 
character images. Here, the boundaries of the found images will be characters (without accents). There may 
be additional images of individual accents (for example, the 'é' has three (3) contours, the contour of the 
word 'e', '^', and an acute accent). To find the contours of a fully accented character image, this paper 
rearranges the rectangular blocks containing these elements, thereby combining the elements together along 
the same y-axis forming an image of the character to be searched. 
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d. Step 4: Cutting and saving the gray image of each character with the same accent, if any 
From the above steps, the area of every character with an accent image is found. The algorithm continues 
to cut each character from grayscale image, then resizes into 28x28 pixels to get the complete Vietnamese 
character image saves it as a file with format .jpg. Figure 2 presents some of the results from the processed 


image. 
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Figure 2. Image processing results 


2.3. Building training and testing data 

From the raw data of 2,000 photos from Vietnamese textbooks, the dataset consists of 86,544 samples 
and is labelled manually after pre-processing. This data set is then divided into two parts: testing and training. 
The testing dataset consists of 50 images per character class for a total of 6,900 samples, while the training 
data set is the remaining one, which is detailed in Table 1. Figure 3 presents are some images of the “O” 
character class retrieved from the training dataset: 
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Table 1. Training dataset statistics 


No. 
No. Char Sam 

ples 
47 o 524 
48 Ò 526 
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51 6 531 
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61 0 511 
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68 S 548 
69 t 539 
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Figure 3. Training data of “Q” character class 


2.4. Building a CNN model for Vietnamese printed character recognition and evaluating identification 
results 
The proposed CNN model consists of four main classes: Convolution class, Relu class, Pooling class, 
Fully-connected class. This paper presents the proposed CNN model in Figure 4. 
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Figure 4. CNN model recognizing Vietnamese characters 


Based on the above figure, a CNN model was built. The model has an Input layer which uses grey 
images of 28x28 pixels. Next, it has 2 Convolution layers followed by 2 Max Pooling layers correspondingly. 
Next are 2 Fully-connected layers. Finally, there is an Output layer having 138 outputs. 

In Convolution layer 1, the input image rows and columns are padded with Os. Because of this, when 
performing the convolution multiplication on the input image, it’s size will remain 28x28 pixels. In this work, 
32 random filters to get 32 output channels. After convolution, the next layer is Max pooling with a size of 
2x2, step size of 2x2. After this step, the dimension of the image is 14x14. 
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In the second Convolution layer, the first layer's image is also added with rows and columns of 0 to 
keep the dimension number 14x14. This layer will give 64 output channels. After convolution, the image is 
passed through the Max pooling layer with a size of 2x2, step size of 2x2, and the dimension of the image is 
7x7. After two layers of Convolution and Max pooling, we get a layer of 7x7x64 = 3,164 neurons. Next, we 
add another fully-connected layer of 1,000 neurons, and finally, the output layer consists of 138 outputs. 

The above model is built in the Keras library [29] with Python language, as shown in Figure 5. 
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Figure 5. CNN model built on Keras library [29-31] 


3. RESULTS AND DISCUSSION 
3.1. CNN model training and evaluation results 
With the training and testing data set prepared, the authors conducted model training in 2 cases: 
- Case 1: Maintaining the entire testing set of 138 * 50 = 6,900 samples, the training uses 200 samples per 
character class, for a total of 138 * 200 = 27,600 samples. After a training time of about 2 hours, we get 
the results in Figure 6, Figure 7. 


Training and validation loss Training and validation accuracy 
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Figure 6. Loss Function chart with a training set Figure 7. Accuracy chart with a training set of 200 
of 200 samples per class samples per class 


In case 1, the accuracy of the training set improves as time goes and the loss of the function goes 
closer to zero. The model results had low accuracy and instability for testing data since loss function was still 
large. This is the main reason why the authors conducted further training with a larger dataset. In case 2, the 
results were better with the full training dataset and gave greater accuracy on the testing dataset. After two 
training sessions, the authors found that the number of 50 training sessions for each training was sufficient 
because the accuracy had almost no significant change in the last period. The loss function also reached a 
threshold that was small enough. Summarizing the two cases, we get the results shown in Table 2. 
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Figure 8. Loss Function diagram with complete data Figure 9. Accuracy chart with complete data set 
set 


Table 2. Table of CNN training model results 
Training Times No. of Training Samples __ No. of Testing Samples — Accuracy 
1 27,600 6,900 91.23% 
2 79,644 6,900 97.14% 


3.2. Token identification in Vietnamese textbooks based on the developed CNN model 

A mobile demo application was built on the Android platform to test the developed CNN model's 
accuracy and practicality. The application's input is an image of each page in the textbook, and the output will 
be a sequence of tokens in text form. Here, the application took a picture, then performed pre-processing steps 
outlined in Section 2, and finally used the trained CNN model to produce the main result, a text string in the 


textbook page that has potential for auto reading application. Below are some demo images of the application 
shown in Figure 10. 
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Figure 10. Images of Vietnamese token recognition in Android-based demo application developed based on 
the built CNN model 
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The photography had some issues with brightness, angle of shooting. Thus, it made the application 
difficult in pre-processing steps and, therefore, not well text recognition. It is also tricky to recognize uppercase 
or lowercase letters because there are similar faces in uppercase and lowercase letters. 


4. CONCLUSION 

Through research and experiment, the author group has achieved the main results: studying in detail 
the deep learning model to build a reduced-character-class CNN model for Vietnamese printed character 
recognition. The CNN model's training and testing datasets include 2,000 raw images, 79,644 training samples, 
and 6,900 testing samples. This work demonstrated that the developed CNN model is useful in image 
recognition since relatively satisfactory results were achieved. The developed model can be used in Vietnamese 
spelling application for learning Vietnamese. The authors also built and installed the CNN model on mobile to 
solve the problem of Vietnamese token identification on textbooks for learning and spelling Vietnamese. 
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